Because both principal component analysis and clustering are directed at finding patterns in a dataset with the possible goal of providing a means to classify, they may appear identical, but have important differences. The principles of tidy data, in which each column in a dataframe represents a variable and each row represents an observation, help illustrate the difference. PCA is intended to reduce the variables to a small number that will still account for much of the variation in the data. We can use the results to make inferences about how the data points are grouped, but those inferences are not a part of PCA itself.
In contrast, clustering paradigms seek to group observations such that closely-related observations are within the same group while the separation between groups is large. Both “closely-related” and “large separation” are subjective parameters, so the method used to cluster data contains certain a priori assumptions. Furthermore, the number of subgroups ultimately generated can vary depending on how the analysis is applied.
Hierarchical clustering proceeds in a sequence of steps, leading to a tree-like network (dendrogram) in which all observations are classified. This can be done in one of two ways: agglomerative (bottom-up) clustering or divisive (top-down) clustering. In agglomerative clustering, all observations are separate before the start, and groups of observations are then formed based on similarity. Based on the measure of distance used to determine similarity or difference, the algorithm can then create second-order groups (groups of groups), then third-order groups, and so on until all observations fit into the dendrogram. This approach allows viewing of clustering at different hierarchical levels. Divisive clustering starts with all observations in a single group and then divides them into smaller and smaller groups.
An example of a hierarchical cluster is species taxonomy, in which all species are part of a single tree of life, but the number of clusters depends on what level we examine: genus, order, class, phylum, etc.
k-means clustering, a form of partitional clustering, divides the observations into a predetermined number of clusters (k) such that the variation within each cluster is minimized and the variation between the clusters is maximized. The ideal number of clusters to be used is not determined by any set rule, although certain guidelines can be used to optimize the number of clusters. Unlike hierarchical clustering, partitional clustering uses a single step to create clusters that is based on the distances between observations. The result resembles the output from PCA, but the algorithm is evaluating observations, not variables.
k-means clustering uses an iterative process to assign k centroids to the observations, group the data so that each observation is assigned to the closest centroid, then optimize the locations of the centroids to minimize the within-cluster distances. As a result, it can be sensitive to outlier data points. It also tends to create clusters of comparable size. Modified versions of k-means clustering exist that use different approaches to determining distance between observations. For this exercise, however, we will examine how k-means clustering works and can be optimized.
As an example, we can use the mtcars dataset that is
loaded in with base R. The table contains technical data from 1974 on 32
different car models from around the world and is a small but useful
dataset to illustrate principles of k-means clustering.
glimpse(mtcars)
## Rows: 32
## Columns: 11
## $ mpg <dbl> 21.0, 21.0, 22.8, 21.4, 18.7, 18.1, 14.3, 24.4, 22.8, 19.2, 17.8,…
## $ cyl <dbl> 6, 6, 4, 6, 8, 6, 8, 4, 4, 6, 6, 8, 8, 8, 8, 8, 8, 4, 4, 4, 4, 8,…
## $ disp <dbl> 160.0, 160.0, 108.0, 258.0, 360.0, 225.0, 360.0, 146.7, 140.8, 16…
## $ hp <dbl> 110, 110, 93, 110, 175, 105, 245, 62, 95, 123, 123, 180, 180, 180…
## $ drat <dbl> 3.90, 3.90, 3.85, 3.08, 3.15, 2.76, 3.21, 3.69, 3.92, 3.92, 3.92,…
## $ wt <dbl> 2.620, 2.875, 2.320, 3.215, 3.440, 3.460, 3.570, 3.190, 3.150, 3.…
## $ qsec <dbl> 16.46, 17.02, 18.61, 19.44, 17.02, 20.22, 15.84, 20.00, 22.90, 18…
## $ vs <dbl> 0, 0, 1, 1, 0, 1, 0, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 0,…
## $ am <dbl> 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0,…
## $ gear <dbl> 4, 4, 4, 3, 3, 3, 3, 4, 4, 4, 4, 3, 3, 3, 3, 3, 3, 4, 4, 4, 3, 3,…
## $ carb <dbl> 4, 4, 1, 1, 2, 1, 4, 2, 2, 4, 4, 3, 3, 3, 4, 4, 4, 1, 2, 1, 1, 2,…
As in PCA, we will scale the data using the scale()
function in base R.
mtcars_s <- scale(mtcars)
summary(mtcars_s)
## mpg cyl disp hp
## Min. :-1.6079 Min. :-1.225 Min. :-1.2879 Min. :-1.3810
## 1st Qu.:-0.7741 1st Qu.:-1.225 1st Qu.:-0.8867 1st Qu.:-0.7320
## Median :-0.1478 Median :-0.105 Median :-0.2777 Median :-0.3455
## Mean : 0.0000 Mean : 0.000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.4495 3rd Qu.: 1.015 3rd Qu.: 0.7688 3rd Qu.: 0.4859
## Max. : 2.2913 Max. : 1.015 Max. : 1.9468 Max. : 2.7466
## drat wt qsec vs
## Min. :-1.5646 Min. :-1.7418 Min. :-1.87401 Min. :-0.868
## 1st Qu.:-0.9661 1st Qu.:-0.6500 1st Qu.:-0.53513 1st Qu.:-0.868
## Median : 0.1841 Median : 0.1101 Median :-0.07765 Median :-0.868
## Mean : 0.0000 Mean : 0.0000 Mean : 0.00000 Mean : 0.000
## 3rd Qu.: 0.6049 3rd Qu.: 0.4014 3rd Qu.: 0.58830 3rd Qu.: 1.116
## Max. : 2.4939 Max. : 2.2553 Max. : 2.82675 Max. : 1.116
## am gear carb
## Min. :-0.8141 Min. :-0.9318 Min. :-1.1222
## 1st Qu.:-0.8141 1st Qu.:-0.9318 1st Qu.:-0.5030
## Median :-0.8141 Median : 0.4236 Median :-0.5030
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 1.1899 3rd Qu.: 0.4236 3rd Qu.: 0.7352
## Max. : 1.1899 Max. : 1.7789 Max. : 3.2117
Once the data are scaled, we want to determine the distance between
observations. There are many ways to measure distance in a matrix. For
the purposes of this exercise, we will use the Euclidean method.
get_dist() is a function in the factoextra
package.
res_dist <- get_dist(mtcars_s, method = "euclidean")
fviz_dist(res_dist, lab_size = 8)
The eclust() function, also in the
factoextra package, incorporates ggplot2 for visualization.
It allows several different clustering paradigms; we will use k-means.
The nstart argument sets the number of configurations for
the algorithm to begin its analysis.
res_kmeans <- eclust(mtcars_s, "kmeans", k.max = 15, nstart = 25)
The function returned 14 clusters. Is this the ideal number? Some
subjective decisions will be made in the analysis. To have an idea of
how different numbers of clusters lead to different interpretations of
the data, we can assign different values to k and see how the groupings
change. This can be accomplished using a for loop that
applies values of k between 2 and 15.
for (i in 2:15) {eclust(mtcars_s, "kmeans", k = i, nstart = 25)}
The fviz_gap_stat() function can offer an idea of the
optimal number of clusters to use. A similar value is stored in the
nbclust element of the object generated by
eclust().
fviz_gap_stat(res_kmeans$gap_stat)
res_kmeans$nbclust
## [1] 14
fviz_silhouette(res_kmeans)
## cluster size ave.sil.width
## 1 1 1 0.00
## 2 2 3 0.46
## 3 3 2 0.47
## 4 4 2 0.03
## 5 5 2 0.64
## 6 6 3 0.68
## 7 7 2 0.85
## 8 8 1 0.00
## 9 9 4 0.20
## 10 10 1 0.00
## 11 11 3 0.22
## 12 12 2 0.48
## 13 13 3 0.63
## 14 14 3 0.42
res_hc <- mtcars %>%
scale() %>% # Scale the data
dist(method = "euclidean") %>% # Compute dissimilarity matrix
hclust(method = "ward.D2") # Compute hierarchical clustering
fviz_dend(res_hc, k = 10, # Use 10 groups
cex = 0.5, # label size
color_labels_by_k = TRUE, # color labels by groups
rect = TRUE # Add rectangles around groups
)
## Warning: `guides(<scale> = FALSE)` is deprecated. Please use `guides(<scale> =
## "none")` instead.
mtcars analysis repeated using only columns
mpg:qsecmtcars_s_lim <- mtcars %>% select(mpg:qsec) %>% scale()
res_dist_lim <- get_dist(mtcars_s_lim, method = "euclidean")
fviz_dist(res_dist_lim, lab_size = 8)
res_lim_kmeans <- eclust(mtcars_s_lim, "kmeans", k.max = 15, nstart = 25)
fviz_gap_stat(res_lim_kmeans$gap_stat)
for (i in 2:15) {eclust(mtcars_s_lim, "kmeans", k = i, nstart = 25)}
library(tidymodels)
set.seed(27) # ensure a reproducible example
centers <- tibble(
cluster = factor(1:3),
num_points = c(100, 150, 50), # number points in each cluster
x1 = c(5, 0, -3), # x1 coordinate of cluster center
x2 = c(-1, 1, -2) # x2 coordinate of cluster center
)
labelled_points <-
centers %>%
mutate(
x = map2(num_points, x1, rnorm),
y = map2(num_points, x2, rnorm)
) %>%
select(-num_points, -x1, -x2) %>%
unnest(cols = c(x, y))
ggplot(labelled_points, aes(x, y, color = cluster)) +
geom_point(alpha = 0.3)
points <-
labelled_points %>%
select(-cluster)
kclust <- kmeans(points, centers = 3)
kclust
## K-means clustering with 3 clusters of sizes 148, 51, 101
##
## Cluster means:
## x y
## 1 0.08853475 1.045461
## 2 -3.14292460 -2.000043
## 3 5.00401249 -1.045811
##
## Clustering vector:
## [1] 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3
## [38] 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3
## [75] 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 1 1 1 1 1 1 1 1 1 1 1
## [112] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [149] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [186] 1 1 1 1 1 1 1 1 1 1 1 1 1 3 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [223] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2
## [260] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
## [297] 2 2 2 2
##
## Within cluster sum of squares by cluster:
## [1] 298.9415 108.8112 243.2092
## (between_SS / total_SS = 82.5 %)
##
## Available components:
##
## [1] "cluster" "centers" "totss" "withinss" "tot.withinss"
## [6] "betweenss" "size" "iter" "ifault"
summary(kclust)
## Length Class Mode
## cluster 300 -none- numeric
## centers 6 -none- numeric
## totss 1 -none- numeric
## withinss 3 -none- numeric
## tot.withinss 1 -none- numeric
## betweenss 1 -none- numeric
## size 3 -none- numeric
## iter 1 -none- numeric
## ifault 1 -none- numeric